Skip to content

[CI] Build the test image once and pull it from ECR#5905

Open
hujc7 wants to merge 3 commits into
isaac-sim:mainfrom
hujc7:jichuanh/main-ci-ecr-build-once
Open

[CI] Build the test image once and pull it from ECR#5905
hujc7 wants to merge 3 commits into
isaac-sim:mainfrom
hujc7:jichuanh/main-ci-ecr-build-once

Conversation

@hujc7

@hujc7 hujc7 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

1. Summary

  • main's build.yml rebuilds the test Docker image independently in both test-isaaclab-tasks and test-general (~23 min each, in parallel, racing the shared GHA layer cache). This builds the same image twice every run.
  • This change builds the image once in a dedicated build job that pushes it to ECR, then has the two test jobs depend on it and pull the prebuilt image instead of rebuilding.
  • Adopts the ecr-build-push-pull action already used by develop/release on the same self-hosted runner pool; it resolves the ECR URL from the runner's SSM ecr-cache-url parameter and falls back to a local build when ECR is unavailable, so no runner regresses.

2. Changes

  1. New .github/actions/ecr-build-push-pull/action.yml — generic drop-in for docker-build with ECR-backed caching (ported verbatim from develop). Builds + pushes the commit-tagged image, or pulls it if it already exists in ECR.
  2. .github/workflows/build.yml:
    • Added a build job that builds and pushes the image to ECR once.
    • test-isaaclab-tasks and test-general now needs: [build] and replace their per-job Build Docker Image step with an ECR pull of the same image-tag.
    • run-tests, result copy/upload, fork-result checks, and combine-results are unchanged. Job graph: build → {test-isaaclab-tasks, test-general} → combine-results.

3. Adapted to main (not a develop bulk-copy)

  • Keeps main's two-job test structure, env vars, run-tests/combine-results, and the existing image (docker/Dockerfile.curobo) — this only changes how many times the image is built, not what is built.
  • Uses cache-tag: cache-main-curobo so main's layer cache does not collide with develop/beta2's namespace in the shared ECR repo.

4. Expected effect

  • The duplicate ~23-min build collapses to one build + a ~1–2 min pull per test job, and the two jobs no longer race the GHA cache.

5. Test plan

  • build.yml and the new action validate as YAML; job dependency graph is build → test jobs → combine-results.
  • Pre-commit clean.
  • CI run (this PR): the build job pushes once; test-isaaclab-tasks/test-general pull the prebuilt image (no rebuild). Draft until that is observed green.

Note: targets main deliberately — develop already uses this build-once/ECR pattern; this brings main's build.yml in line.

build.yml rebuilt the test image independently in both test-isaaclab-tasks
and test-general (~23 min each, run in parallel and racing the GHA layer
cache). Add a single build job that builds the image and pushes it to ECR
via the ecr-build-push-pull action (ported from develop), and have the test
jobs depend on it and pull the prebuilt image instead of rebuilding.

The action resolves the ECR URL from the runner's SSM ecr-cache-url
parameter and falls back to a local build when ECR is unavailable, so
runners without ECR keep working as before. Preserves main's image
(docker/Dockerfile.curobo), test execution, and result handling.

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — PR #5905

Summary

This PR eliminates a redundant ~23-min Docker image build by introducing a dedicated build job that pushes to ECR once, then has both test-isaaclab-tasks and test-general pull the prebuilt image. A new composite action (.github/actions/ecr-build-push-pull/action.yml) encapsulates the build/push/pull logic with a multi-level caching strategy (exact commit tag → deps hash → full build).


🧪 Isaac Lab Expert

Architecture: ✅ Correct build-once/pull pattern

  • The job dependency graph (build → {test-isaaclab-tasks, test-general} → combine-results) ensures the image is always available before test jobs start.
  • The cache-tag: cache-main-curobo correctly namespaces main's layer cache away from develop/beta2.
  • The action is ported from the already-proven develop workflow — good reuse.

Race condition risk: ✅ None identified

  • The needs: [build] dependency in both test jobs provides a hard barrier — no test job starts until the build job succeeds.
  • The previous approach had both test jobs building in parallel racing the GHA layer cache; this eliminates that race entirely.

Runner affinity consideration: ⚠️ Minor

  • The build job and test jobs all use [self-hosted, gpu] runners. If the test jobs land on a different runner than the build job, they will pull from ECR (not local cache). This is by design and the action handles it correctly via docker manifest inspect → pull → tag. However, if the ECR push in the build job fails silently (see below), the test jobs would fall back to a full local build — which is actually the correct degraded behavior.

🔇 Silent Failure Hunter

1. ECR push failure in build job (Step 7) — Medium risk

  • If docker push fails in the build job's "Push to ECR" step, it will cause the build job to fail (non-zero exit), which would block test jobs entirely due to needs: [build].
  • ✅ This is correct behavior — a hard failure is better than a silent one.

2. ECR unavailability in test jobs — ✅ Safe

  • When test jobs invoke the action and ECR is unavailable, the action resolves no ECR URL and falls back to a full local build (Step 6). The || true guards on IMDS/SSM calls prevent hard failures during resolution.
  • However, since the build job already pushed the image, the test jobs should always find it via the commit-tagged ECR image (Step 4 manifest check → pull).

3. Docker credential store bypass — ✅ Properly handled

  • Step 1 creates a temp config with credsStore: "" and cleans it up in Step 9 (if: always()).

4. Potential gap: docker buildx imagetools create failure in deps-cache step (Step 5)

  • If docker buildx imagetools create -t "${ECR_IMAGE}" "${DEPS_ECR_IMAGE}" fails, the step would exit non-zero and the build job would fail. This is safe (fails loud).

5. || true patterns in resolution steps — Acceptable

  • The || true guards on IMDS token, instance ID, region, and SSM are intentional — they allow graceful degradation on non-EC2 runners.

📊 Test Coverage Analyzer

  • No new tests needed. This is a CI-infrastructure-only change. The existing CI run of this PR serves as the integration test (build → pull → test pattern).
  • The test plan in the PR description is appropriate: validate YAML, observe the green CI run.

🏁 Verdict: LGTM

Well-structured CI optimization that follows the proven pattern from develop. The build-once/pull architecture is correct, race conditions are eliminated by job dependencies, and silent failures are properly handled with graceful degradation.

Minor suggestions (non-blocking):

  1. Consider adding a brief comment in build.yml noting that ecr-build-push-pull in test jobs will effectively just pull (since build already pushed), to aid future maintainers reading the workflow.
  2. The timeout-minutes: 180 on the build job matches the test jobs — this is generous for a single build. A tighter timeout (e.g., 60 min) could catch hangs earlier, though this is cosmetic given the existing test job timeouts.

Reviewed at commit: 7c22ca8


Update (c3c195c): New push is a merge from main into the feature branch (nightly-changelog workflow improvements, daqp bump to 0.8.5, version bump to 0.54.4). The PR's own files (.github/actions/ecr-build-push-pull/action.yml and .github/workflows/build.yml) are unchanged. Previous review still applies — LGTM holds.

Re-validated at: c3c195c | 2026-06-02T18:14 UTC


Update (49388a8): Two targeted improvements to the composite action:

  1. 🔒 Security fix (NGC credential leak): Changed docker login -p $NGC_API_KEY to pipe the key via --password-stdin, preventing the secret from appearing in /proc/<pid>/cmdline on shared self-hosted runners. Good catch — this was a real exposure vector on multi-tenant machines.

  2. 📋 Observability improvement (cache fallback logging): Added explicit log messages when ECR cache is available vs unavailable. The else branch now logs 🟠 ECR unavailable: full local build with no shared cache (build-once degrades to build-per-job) — making degradation immediately visible in CI output rather than requiring log archaeology.

Both changes are non-functional (no logic change), improve security posture and debuggability. LGTM still holds. ✅

Re-validated at: 49388a8 | 2026-06-05T21:28 UTC

@hujc7 hujc7 marked this pull request as ready for review June 5, 2026 20:38
@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR eliminates the duplicate ~23-minute Docker image build that previously ran independently in both test-isaaclab-tasks and test-general by introducing a dedicated build job that builds and pushes the image to ECR once, then having both test jobs pull it.

  • A new composite action ecr-build-push-pull implements a three-tier cache strategy: (1) exact per-commit ECR tag, (2) a content-hash-based deps tag, and (3) full docker buildx build with ECR layer cache — with a local-only fallback when ECR is unreachable.
  • build.yml gains a build job at the head of the graph (build → {test-isaaclab-tasks, test-general} → combine-results); the per-job docker-build steps are replaced by calls to the new action that will hit the ECR pull path after the build job completes.
  • The cache-tag: cache-main-curobo namespace keeps main's ECR layer cache isolated from develop/beta2.

Confidence Score: 4/5

Safe to merge — the workflow change is well-structured and the fallback path is sound; the only noted issue is a minor docker credential hygiene concern.

The build job correctly gates both test jobs via needs, so test jobs never run against a missing image. The three-tier ECR caching logic handles all cache states correctly, and the fallback to local-only build when ECR is unavailable preserves the pre-PR behavior. The one finding worth fixing is the NGC API key being passed as a -p flag to docker login rather than via --password-stdin, which briefly exposes it in the process table on the self-hosted runner.

.github/actions/ecr-build-push-pull/action.yml — docker login credential handling on line 66; otherwise the workflow changes in build.yml are straightforward.

Important Files Changed

Filename Overview
.github/actions/ecr-build-push-pull/action.yml New composite action implementing a multi-tier ECR caching strategy (exact commit tag → deps hash → full build); minor: docker login uses -p flag instead of --password-stdin
.github/workflows/build.yml Adds a dedicated build job that pushes to ECR once; test jobs now needs: [build] and replace docker-build with ecr-build-push-pull to pull the prebuilt image — job graph and environment variables are unchanged

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build job] --> B{resolve-ecr}
    B -- ECR available --> C{exact commit tag in ECR?}
    C -- hit --> D[Pull exact image from ECR]
    C -- miss --> E{deps hash in ECR?}
    E -- hit --> F[imagetools create alias commit tag]
    E -- miss --> G[Full docker buildx build]
    G --> H[Push commit tag to ECR]
    H --> I[Push deps tag to ECR]
    B -- ECR unavailable --> G2[Full local build, no push]
    D --> J[build job complete]
    F --> J
    H --> J
    G2 --> J
    J --> K[test-isaaclab-tasks]
    J --> L[test-general]
    K --> K1[ecr-build-push-pull: pulls or rebuilds]
    L --> L1[ecr-build-push-pull: pulls or rebuilds]
    K1 --> M[run-tests]
    L1 --> N[run-tests]
    M --> O[combine-results]
    N --> O
Loading

Reviews (1): Last reviewed commit: "Merge branch 'main' into jichuanh/main-c..." | Re-trigger Greptile


if [ -n "${{ env.NGC_API_KEY }}" ]; then
echo "🔵 Logging into nvcr.io..."
docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Passing the NGC API key as a -p flag exposes the secret value in the process argument list (/proc/<pid>/cmdline) for the duration of the docker login call. Any process on the same self-hosted runner with read access to /proc can capture it. Piping via --password-stdin avoids this entirely.

Suggested change
docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io
echo "${{ env.NGC_API_KEY }}" | docker login -u \$oauthtoken --password-stdin nvcr.io

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +253 to +254
- name: Full build
if: steps.pull-exact.outputs.hit != 'true' && steps.deps-cache.outputs.deps-cache-hit != 'true'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Full build runs in test jobs when ECR unavailable

When resolve-ecr resolves no ECR URL, both pull-exact and deps-cache are skipped (their if conditions gate on resolve-ecr.outputs.available == 'true'). Because neither step ran, their outputs remain unset (not 'true'), so the Full build condition passes and each test job performs an independent full build from scratch — which is the correct fallback, but means the build-once goal silently degrades to build-three-times when ECR is unavailable. The PR description mentions this explicitly, so it is intentional; consider adding a log line in the action to make this fallback observable in the CI log.

Pass the NGC API key to docker login via --password-stdin instead of
-p so it never appears in the process argument list on the shared
self-hosted runner.

Log a clear line in the full-build step when ECR is unavailable, so the
degradation from build-once to build-per-job is visible in CI logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant